Instructions¶
Labeling & Peer Grading: Your homework will be peer graded. To stay anonymous, avoid using your name and label your file with the last four digits of your student ID (e.g., HW#_Solutions_3938).
Submission: Submit both your IPython notebook (.ipynb) and an HTML file of the notebook to Canvas under Assignments → HW # → Submit Assignment. After submitting, download and check the files to make sure that you've uploaded the correct versions. Both files are required for your HW to be graded.
No pdf file required so write all the details in your ipynb file.
AI Use Policy: Solve each problem independently by yourself. Use AI tools like ChatGPT or Google Gemini for brainstorming and learning only—copying AI-generated content is prohibited. You do not neeViolations will lead to penalties, up to failing the course.
Problem Structure: Break down each problem ( already done in most problems) into three interconnected parts and implement each in separate code cells. Ensure that each part logically builds on the previous one. Include comments in your code to explain its purpose, followed by a Markdown cell analyzing what was achieved. After completing all parts, add a final Markdown cell reflecting on your overall approach, discussing any challenges faced, and explaining how you utilized AI tools in your process.
Deadlines & Academic Integrity: This homework is due on 10/01/2024 at midnight. Disclosure of this assignment and assignment answers to anyone or any website is a contributory infringement of academic dishonesty at ISU. Do not share or post course materials without the express written consent of the copyright holder and instructor. The class will follow Iowa State University’s policy on academic dishonesty. Anyone suspected of academic dishonesty will be reported to the Dean of Students Office.
Each problem is worth 25 points. Total $\bf 25\times 4 = 100$.¶
Problem 1.¶
Upload the textdata.csv and preprocess the text excerpts in the text column.
- Find the various numerical information related to these text excerpts and add them to the textdata.csv as new columns with appropriate labels. The target variable "Bradley-Terry_Score"(https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) is related to the readability of the text excerpt; use the following links to learn more about various other scores and create new columns ( at least 10) with each text excerpts respective score and add the columns in the data. More information on text statistics are in https://pypi.org/project/textatistic and https://pypi.org/project/textstat/
- Perform feature selection using methods such as correlation analysis, Recursive Feature Elimination (RFE), SelectKBest, or other relevant techniques, considering Bradley_Terry_Score as the target. Display a correlation heat map of the selected features and the target variable.
- Create multiple linear regression models using Bradley_Terry_Score as the target variable, testing with three different test set sizes: 20%, 25%, and 30%. Cross-validate all models and summarize the test set metrics, including Mean Absolute Deviation (MAD), and R-squared (R²) in a table to identify the best model. Assess the suitability of developing a regression model for this problem, and provide your rationale based on the data and analysis results.
import warnings
warnings.filterwarnings('ignore')
# Upload data
import pandas as pd
import numpy as np
textdata = pd.read_csv("textdata.csv")
textdata.head(10)
| textid | text | Bradly_Terry_Score | |
|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | -0.340259 |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | -0.315372 |
| 2 | b69ac6792 | As Roger had predicted, the snow departed as q... | -0.580118 |
| 3 | dd1000b26 | And outside before the palace a great garden w... | -1.054013 |
| 4 | 37c1b32fb | Once upon a time there were Three Bears who li... | 0.247197 |
| 5 | f9bf357fe | Hal and Chester found ample time to take an in... | -0.861809 |
| 6 | eaf8e7355 | Hal Paine and Chester Crawford were typical Am... | -1.759061 |
| 7 | 0a43a07f1 | On the twenty-second of February, 1916, an aut... | -0.952325 |
| 8 | f7eff7419 | The boys left the capitol and made their way d... | -0.371641 |
| 9 | d96e6dbcd | One day he had gone beyond any point which he ... | -1.238432 |
textdata.text[0]
'When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes.'
# Use count vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english').fit(list(textdata.text))
trainedtext1 = vect.transform(list(textdata.text))
colnames1 = vect.get_feature_names_out()
nparray1 = trainedtext1.toarray()
df1 = pd.DataFrame(nparray1, columns = colnames1)
df1.head(2)
| 00 | 000 | 000th | 001 | 02 | 03 | 034 | 04 | 049 | 06 | ... | µv | ½d | ædui | ægidus | æmilius | æneas | æolian | æquians | æschylus | ça | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 26526 columns
# Use tfidf vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english').fit(list(textdata.text))
trainedtext2 = tfidf_vect.transform(list(textdata.text))
# As the names are same as count vectorizor, just added "A" to make the
#feature names different
colnames2 = ["A" + item for item in tfidf_vect.get_feature_names_out()]
nparray2 = trainedtext2.toarray()
df2 = pd.DataFrame(nparray2, columns = colnames2)
df2.head(2)
| A00 | A000 | A000th | A001 | A02 | A03 | A034 | A04 | A049 | A06 | ... | Aµv | A½d | Aædui | Aægidus | Aæmilius | Aæneas | Aæolian | Aæquians | Aæschylus | Aça | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 rows × 26526 columns
# Combines to data frames to one by cancatenating the columns.
df = pd.concat([df1,df2], axis = 1)
df.shape
(2834, 53052)
!pip install textstat
Requirement already satisfied: textstat in /opt/anaconda3/lib/python3.12/site-packages (0.7.4) Requirement already satisfied: pyphen in /opt/anaconda3/lib/python3.12/site-packages (from textstat) (0.16.0) Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.12/site-packages (from textstat) (74.1.2)
import textstat
from textblob import TextBlob
def find_various_scores(text_data):
# Readability scores
fre = textstat.flesch_reading_ease(text_data)
fkg = textstat.flesch_kincaid_grade(text_data)
smog = textstat.smog_index(text_data)
cli = textstat.coleman_liau_index(text_data)
ari = textstat.automated_readability_index(text_data)
dcr = textstat.dale_chall_readability_score(text_data)
dw = textstat.difficult_words(text_data)
lwf = textstat.linsear_write_formula(text_data)
gf = textstat.gunning_fog(text_data)
fh = textstat.fernandez_huerta(text_data)
sp = textstat.szigriszt_pazos(text_data)
gp = textstat.gutierrez_polini(text_data)
craw = textstat.crawford(text_data)
gi = textstat.gulpease_index(text_data)
osman = textstat.osman(text_data)
syllable_count = textstat.syllable_count(text_data)
character_count = textstat.char_count(text_data)
word_count = textstat.lexicon_count(text_data, removepunct=True)
sentence_count = textstat.sentence_count(text_data)
words = text_data.split()
lexical_density = len([word for word in words if word.isalpha()]) / len(words)
ttr = len(set(words)) / len(words)
hapax_legomena = len([word for word in set(words) if words.count(word) == 1])
sentences = text_data.split('.')
avg_sentence_length = sum(len(sentence.split()) for sentence in sentences) / len(sentences)
complex_word_count = len([word for word in words if textstat.syllable_count(word) >= 3])
sentiment = TextBlob(text_data).sentiment
polarity = sentiment.polarity
subjectivity = sentiment.subjectivity
scores = [fre, fkg, smog, cli, ari, dcr, dw, lwf, gf, fh, sp, gp, craw, gi, osman,
syllable_count, character_count, word_count, sentence_count, lexical_density,
ttr, hapax_legomena, avg_sentence_length, complex_word_count, polarity, subjectivity]
return scores
# Create a new data frame of reading scores.
score_labels = ["fre", "fkg", "smog", "cli", "ari", "dcr", "dw", "lwf", "gf", "fh", "sp", "gp", "craw", "gi",
"osman", "syllable_count", "character_count", "word_count", "sentence_count", "lexical_density",
"ttr", "hapax_legomena", "avg_sentence_length", "complex_word_count", "polarity", "subjectivity"]
scores = [find_various_scores(textdata.text[i]) for i in range(len(textdata))]
scoresdf = pd.DataFrame(scores, columns = score_labels)
scoresdf.head(2)
| fre | fkg | smog | cli | ari | dcr | dw | lwf | gf | fh | ... | character_count | word_count | sentence_count | lexical_density | ttr | hapax_legomena | avg_sentence_length | complex_word_count | polarity | subjectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 80.31 | 6.1 | 8.6 | 7.94 | 8.1 | 7.80 | 25 | 9.00 | 8.31 | 112.21 | ... | 814 | 179 | 11 | 0.849162 | 0.636872 | 89 | 14.916667 | 10 | 0.134848 | 0.525758 |
| 1 | 84.57 | 4.5 | 8.0 | 6.31 | 6.0 | 6.39 | 17 | 6.25 | 6.73 | 116.50 | ... | 769 | 169 | 14 | 0.727811 | 0.751479 | 107 | 15.545455 | 10 | 0.133999 | 0.566643 |
2 rows × 26 columns
# Finally combined all the data sets with features.
finaldf = pd.concat([df,scoresdf], axis = 1)
finaldf.shape
(2834, 53078)
## We can add the target bariable to data but not required.
combined_df = finaldf
combined_df["Bradly_Terry_Score"] = textdata["Bradly_Terry_Score"]
# Quickly run a regression model to see what we are doing.
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
y = textdata.Bradly_Terry_Score
RLR = LinearRegression().fit(finaldf, y)
ypred = RLR.predict(finaldf)
metrics.r2_score(y, ypred)
1.0
# Filter the features with the correlation of more than 0.25 in absolute value
corrs = finaldf.corrwith(textdata['Bradly_Terry_Score']).abs().to_frame().reset_index()
corrs.columns = ["feature", "correlation"]
# Only significantly correlated with the absolute correlation of 0.25 or more
corrs = corrs[corrs["correlation"] >= 0.25]
corrs.shape
(23, 2)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (20,16))
highly_correlated = combined_df[list(corrs.feature)]
sns.heatmap(round(highly_correlated.corr(),2), cmap="Reds", annot=True)
<Axes: >
As demonstrated in the analysis, our model is experiencing overfitting. I experimented with various values of k, ranging from 50 to 300, and even tried larger values like 500 and 1000. However, those higher values of k led to significant overfitting.¶
For each value of k, I calculated key performance metrics, including R² and MAE for both the training and test datasets. Additionally, I computed the absolute difference between the training and testing scores (R² and MAE) to assess the model’s generalization performance.¶
These metrics were plotted against k, and from the results, it is evident that k = 100 appears to be the optimal choice from the range of values tested. At this point, the model strikes a good balance between underfitting and overfitting, as indicated by the minimal gap between training and testing scores.¶
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
results = []
y = textdata.Bradly_Terry_Score
# Loop over values of k from 50 to 300 with increments of 10
for k in range(50, 301, 10):
# Select K best features based on f_regression
selector = SelectKBest(f_regression, k=k)
X_new = selector.fit_transform(finaldf, y)
scores = selector.scores_
feature_names = finaldf.columns
selected_features = sorted(zip(scores, feature_names), reverse=True)[:k]
selected_features = pd.DataFrame(selected_features, columns=["fscore", 'feature'])
x = finaldf[list(selected_features.feature)]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=99)
rlr = LinearRegression().fit(xtrain, ytrain)
rlrtrainpred = rlr.predict(xtrain)
rlrtestpred = rlr.predict(xtest)
rlrtrain_r2 = r2_score(ytrain, rlrtrainpred)
rlrtest_r2 = r2_score(ytest, rlrtestpred)
rlrtrain_mae = mean_absolute_error(ytrain, rlrtrainpred)
rlrtest_mae = mean_absolute_error(ytest, rlrtestpred)
r2_diff = abs(rlrtrain_r2 - rlrtest_r2)
mae_diff = abs(rlrtrain_mae - rlrtest_mae)
results.append([k, rlrtrain_r2, rlrtest_r2, rlrtrain_mae, rlrtest_mae, r2_diff, mae_diff])
results_df = pd.DataFrame(results, columns=["k", "Training R^2", "Testing R^2", "Training MAE",
"Testing MAE", "R^2 Diff", "MAE Diff"])
results_df
| k | Training R^2 | Testing R^2 | Training MAE | Testing MAE | R^2 Diff | MAE Diff | |
|---|---|---|---|---|---|---|---|
| 0 | 50 | 0.473059 | 0.428579 | 0.595679 | 0.631539 | 0.044481 | 0.035859 |
| 1 | 60 | 0.478712 | 0.432900 | 0.591193 | 0.631176 | 0.045812 | 0.039983 |
| 2 | 70 | 0.483812 | 0.434649 | 0.588988 | 0.628207 | 0.049164 | 0.039219 |
| 3 | 80 | 0.491386 | 0.438847 | 0.585684 | 0.624213 | 0.052539 | 0.038529 |
| 4 | 90 | 0.497236 | 0.455822 | 0.582538 | 0.612134 | 0.041414 | 0.029597 |
| 5 | 100 | 0.499460 | 0.459576 | 0.581504 | 0.608413 | 0.039884 | 0.026910 |
| 6 | 110 | 0.506515 | 0.462360 | 0.578041 | 0.607275 | 0.044156 | 0.029234 |
| 7 | 120 | 0.513370 | 0.464354 | 0.573091 | 0.603305 | 0.049016 | 0.030215 |
| 8 | 130 | 0.518394 | 0.467688 | 0.570319 | 0.602883 | 0.050707 | 0.032564 |
| 9 | 140 | 0.524655 | 0.465978 | 0.566429 | 0.603145 | 0.058677 | 0.036717 |
| 10 | 150 | 0.526992 | 0.474836 | 0.564646 | 0.598683 | 0.052156 | 0.034037 |
| 11 | 160 | 0.533485 | 0.477909 | 0.558736 | 0.594478 | 0.055575 | 0.035742 |
| 12 | 170 | 0.539114 | 0.482567 | 0.555165 | 0.591401 | 0.056547 | 0.036235 |
| 13 | 180 | 0.544298 | 0.482961 | 0.551881 | 0.589890 | 0.061337 | 0.038009 |
| 14 | 190 | 0.548352 | 0.481771 | 0.549274 | 0.590751 | 0.066581 | 0.041477 |
| 15 | 200 | 0.551697 | 0.478382 | 0.547183 | 0.592717 | 0.073315 | 0.045534 |
| 16 | 210 | 0.554866 | 0.480589 | 0.546476 | 0.590899 | 0.074277 | 0.044423 |
| 17 | 220 | 0.563283 | 0.472511 | 0.540576 | 0.601800 | 0.090772 | 0.061224 |
| 18 | 230 | 0.565737 | 0.471946 | 0.538634 | 0.603924 | 0.093791 | 0.065290 |
| 19 | 240 | 0.566656 | 0.474461 | 0.537568 | 0.601785 | 0.092195 | 0.064217 |
| 20 | 250 | 0.571311 | 0.468473 | 0.534320 | 0.604913 | 0.102838 | 0.070593 |
| 21 | 260 | 0.572278 | 0.468171 | 0.533805 | 0.603410 | 0.104107 | 0.069606 |
| 22 | 270 | 0.574425 | 0.468176 | 0.532554 | 0.606380 | 0.106249 | 0.073826 |
| 23 | 280 | 0.577111 | 0.460261 | 0.530374 | 0.608635 | 0.116850 | 0.078261 |
| 24 | 290 | 0.580748 | 0.457493 | 0.529132 | 0.609728 | 0.123256 | 0.080596 |
| 25 | 300 | 0.586936 | 0.452192 | 0.524680 | 0.613878 | 0.134744 | 0.089198 |
# Bias variance trade off
import matplotlib.pyplot as plt
print("Bias Variace Trade off")
# Create subplots for better visualization
fig, axs = plt.subplots(2, 2, figsize=(14, 10))
# Plot R² (Training and Testing)
axs[0, 0].plot(results_df['k'], results_df['Training R^2'], label="Training R²", color="blue", marker='o')
axs[0, 0].plot(results_df['k'], results_df['Testing R^2'], label="Testing R²", color="green", marker='o')
axs[0, 0].set_title('R² Scores vs k')
axs[0, 0].set_xlabel('k (Number of Features)')
axs[0, 0].set_ylabel('R² Score')
axs[0, 0].legend()
axs[0, 0].grid(True)
# Plot MAE (Training and Testing)
axs[0, 1].plot(results_df['k'], results_df['Training MAE'], label="Training MAE", color="blue", marker='o')
axs[0, 1].plot(results_df['k'], results_df['Testing MAE'], label="Testing MAE", color="green", marker='o')
axs[0, 1].set_title('MAE Scores vs k')
axs[0, 1].set_xlabel('k (Number of Features)')
axs[0, 1].set_ylabel('MAE Score')
axs[0, 1].legend()
axs[0, 1].grid(True)
# Plot R² Differences (Training - Testing)
axs[1, 0].plot(results_df['k'], results_df['R^2 Diff'], label="R² Difference", color="red", marker='o')
axs[1, 0].set_title('R² Difference (Train - Test) vs k')
axs[1, 0].set_xlabel('k (Number of Features)')
axs[1, 0].set_ylabel('R² Difference')
axs[1, 0].legend()
axs[1, 0].grid(True)
# Plot MAE Differences (Training - Testing)
axs[1, 1].plot(results_df['k'], results_df['MAE Diff'], label="MAE Difference", color="red", marker='o')
axs[1, 1].set_title('MAE Difference (Train - Test) vs k')
axs[1, 1].set_xlabel('k (Number of Features)')
axs[1, 1].set_ylabel('MAE Difference')
axs[1, 1].legend()
axs[1, 1].grid(True)
plt.suptitle(r'$\bf{Bias-Variance\ Trade-off}$', fontsize=18)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
Bias Variace Trade off
With optimum k = 100, we will make a model and cross validate.¶
from sklearn.model_selection import cross_val_score
selector = SelectKBest(f_regression, k=100)
x_selected = selector.fit_transform(finaldf, y)
xtrain, xtest, ytrain, ytest = train_test_split(x_selected, y, test_size=0.20, random_state=11)
rlr = LinearRegression().fit(xtrain, ytrain)
cv_scores = cross_val_score(rlr, xtrain, ytrain, cv=10)
print("Scores for each fold:", cv_scores)
print("Mean score:", cv_scores.mean())
Scores for each fold: [0.46876487 0.49124043 0.37081605 0.50294202 0.3823276 0.48079141 0.3472443 0.51423927 0.43114225 0.46785066] Mean score: 0.4457358858677498
Now we try different test sizes to see if it makes any difference.¶
test_sizes = [0.2, 0.25, 0.3]
results = []
for test_size in test_sizes:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=test_size, random_state=33)
rlr = LinearRegression().fit(xtrain, ytrain)
rlrtrainpred = rlr.predict(xtrain)
rlrtestpred = rlr.predict(xtest)
r2_train = r2_score(ytrain, rlrtrainpred)
r2_test = r2_score(ytest, rlrtestpred)
mae_train = mean_absolute_error(ytrain, rlrtrainpred)
mae_test = mean_absolute_error(ytest, rlrtestpred)
r2_diff = abs(r2_train - r2_test)
mae_diff = abs(mae_train - mae_test)
results.append([test_size, r2_train, r2_test, mae_train, mae_test, r2_diff, mae_diff])
columns = ['Test Size', 'R² Train', 'R² Test', 'MAE Train', 'MAE Test', 'R² Diff', 'MAE Diff']
df_results = pd.DataFrame(results, columns=columns)
df_results
| Test Size | R² Train | R² Test | MAE Train | MAE Test | R² Diff | MAE Diff | |
|---|---|---|---|---|---|---|---|
| 0 | 0.20 | 0.592301 | 0.446837 | 0.519889 | 0.628889 | 0.145464 | 0.109001 |
| 1 | 0.25 | 0.591647 | 0.462660 | 0.517424 | 0.624197 | 0.128987 | 0.106774 |
| 2 | 0.30 | 0.593872 | 0.448855 | 0.517783 | 0.618661 | 0.145017 | 0.100878 |
scores = cross_val_score(rlr, xtrain, ytrain , cv=10)
print("Scores for each fold:", scores)
print("Mean score:", scores.mean())
Scores for each fold: [0.53364538 0.40225684 0.31411637 0.36895486 0.40547841 0.36921788 0.41510159 0.49053608 0.43035678 0.39776091] Mean score: 0.412742508752204
We performed extensive work to build and evaluate the regression model, starting with selecting an optimal number of features and experimenting with various test sizes. We calculated two key metrics—R² and MAE—and explored the bias-variance trade-off. Our analysis showed that using 100 features (k = 100) and a test size of 20% or 25% provided the best results for our model. However, it's important to note that despite all these efforts, the scores achieved by the model are not particularly impressive.¶
In summary, we used a variety of numerical features extracted from text data to predict the Bradley-Terry Score, which measures readability in some form. The challenge here lies in predicting such an obscure score, which may have been defined in a different context, using numerical information that may not be directly related to the score itself. Our model performed reasonably well, but additional, more text-specific features (beyond simple vectorization techniques) could likely improve the prediction accuracy. Moreover, modern deep learning-based language models might be better suited to this task.¶
Overall, while our current model has limitations, this was a valuable learning exercise and a great opportunity to practice feature selection, model evaluation, and understanding the intricacies of predictive modeling. We might also consider refining the way we calculate the actual Bradley-Terry Score for better results in the future.¶
Problem 2.¶
You use the data from problem 1 with numerical columns for this problem.
- Create a new column called "difficulty_level" that has 6 classes: very_hard, hard, challenging, moderate, easy, very_easy using the Bradly_Terry_Score scores. Note that negative scores mean harder to read, and positive scores mean easier to read. Use the boundary points as <-2.05, <-1.45, <-0.95, <-0.5, <0.08, and >= 0.08. Then use the feature selection method(s) for a classification model to select features to classify difficulty_level.
- Now create classification model(s) (of your choice) with difficulty_level as a target variable. Use three different test set sizes 20%, 25%, and 30%. Make sure to cross-validate your models. Summarize the classification accuracy score using a table and pick your best model.
- Make a test set precision, recall, and F1 score table for your best model in part 2. Note that we have a multiclass classification problem. Use your best model to determine which of the 6 classes the following text excerpt should belong to?
textdata['label'] = ['very_hard' if x < -2.05 else 'hard'
if -2.05 <= x <-1.45 else "challenging"
if -1.45 <= x <-0.95 else "moderate"
if -0.95 <= x <-0.5 else "easy"
if -0.5 <= x <0.08 else 'very_easy'
for x in list(textdata.Bradly_Terry_Score)]
textdata.head(2)
| textid | text | Bradly_Terry_Score | label | |
|---|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | -0.340259 | easy |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | -0.315372 | easy |
I am adding the newexcerpt as a new row sothat I can find the numerical information about it now and use it later for predecitions.¶
new_excerpt = """
Business analytics leverages advanced statistical modeling, predictive algorithms, and optimization techniques
to derive actionable insights from organizational data. Data is sourced from systems like ERP and CRM, then processed
through data pipelines for cleansing, normalization, and feature engineering. Analysts apply methods such as principal
component analysis (PCA) and k-means clustering for dimensionality reduction and segmentation, respectively.
Predictive models, including logistic regression, gradient boosting machines (GBM), and neural networks,
are deployed for forecasting and classification tasks. Complex optimization techniques, such as mixed-integer linear
programming (MILP), enhance resource allocation and operational planning. Tools like Python, R, and SQL,
integrated with BI platforms like Tableau, support dynamic visualizations and scenario analysis,
driving strategic decision-making.
"""
newrow = {'textid': 'newexcerpt', 'text': new_excerpt, 'Bradly_Terry_Score': "", 'label': ""}
new_df = pd.DataFrame([newrow])
newtext = pd.concat([textdata, new_df], ignore_index=True)
newtext.tail(3)
| textid | text | Bradly_Terry_Score | label | |
|---|---|---|---|---|
| 2832 | 15e2e9e7a | Solids are shapes that you can actually touch.... | -0.215279 | easy |
| 2833 | 5b990ba77 | Animals are made of many cells. They eat thing... | 0.300779 | very_easy |
| 2834 | newexcerpt | \nBusiness analytics leverages advanced statis... |
# Use count vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english').fit(list(newtext.text))
newtrainedtext1 = vect.transform(list(newtext.text))
newcols1 = vect.get_feature_names_out() # Extract the feature names which are just words
newarray1 = newtrainedtext1.toarray()
newdf1 = pd.DataFrame(newarray1, columns = newcols1)
newdf1.head(2)
| 00 | 000 | 000th | 001 | 02 | 03 | 034 | 04 | 049 | 06 | ... | µv | ½d | ædui | ægidus | æmilius | æneas | æolian | æquians | æschylus | ça | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 26551 columns
# Use tfidf vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english').fit(list(newtext.text))
newtrainedtext2 = tfidf_vect.transform(list(newtext.text))
# As the names are same as count vectorizor, just added "A" to make the
#feature names different
newcols2 = ["A" + item for item in tfidf_vect.get_feature_names_out()]
newarray2 = newtrainedtext2.toarray()
newdf2 = pd.DataFrame(newarray2, columns = newcols2)
newdf2.head(2)
| A00 | A000 | A000th | A001 | A02 | A03 | A034 | A04 | A049 | A06 | ... | Aµv | A½d | Aædui | Aægidus | Aæmilius | Aæneas | Aæolian | Aæquians | Aæschylus | Aça | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 rows × 26551 columns
# Combines to data frames to one by cancatenating the columns.
newdf = pd.concat([newdf1, newdf2], axis = 1)
newdf.shape
(2835, 53102)
# Create a new data frame of reading scores.
newscore_labels = ["fre", "fkg", "smog", "cli", "ari", "dcr", "dw", "lwf", "gf", "fh", "sp", "gp", "craw", "gi",
"osman", "syllable_count", "character_count", "word_count", "sentence_count", "lexical_density",
"ttr", "hapax_legomena", "avg_sentence_length", "complex_word_count", "polarity", "subjectivity"]
newscores = [find_various_scores(newtext.text[i]) for i in list(range(len(newdf)))]
newscoresdf = pd.DataFrame(newscores, columns = newscore_labels)
newscoresdf.head(2)
| fre | fkg | smog | cli | ari | dcr | dw | lwf | gf | fh | ... | character_count | word_count | sentence_count | lexical_density | ttr | hapax_legomena | avg_sentence_length | complex_word_count | polarity | subjectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 80.31 | 6.1 | 8.6 | 7.94 | 8.1 | 7.80 | 25 | 9.00 | 8.31 | 112.21 | ... | 814 | 179 | 11 | 0.849162 | 0.636872 | 89 | 14.916667 | 10 | 0.134848 | 0.525758 |
| 1 | 84.57 | 4.5 | 8.0 | 6.31 | 6.0 | 6.39 | 17 | 6.25 | 6.73 | 116.50 | ... | 769 | 169 | 14 | 0.727811 | 0.751479 | 107 | 15.545455 | 10 | 0.133999 | 0.566643 |
2 rows × 26 columns
# Finally combined all the data sets with features.
newfinaldf = pd.concat([newdf,newscoresdf], axis = 1)
newfinaldf.shape
(2835, 53128)
# Remove the newexcerpt related row.
numerical_newexcerpt = newfinaldf.tail(1)
numerical_newexcerpt
| 00 | 000 | 000th | 001 | 02 | 03 | 034 | 04 | 049 | 06 | ... | character_count | word_count | sentence_count | lexical_density | ttr | hapax_legomena | avg_sentence_length | complex_word_count | polarity | subjectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2834 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 805 | 111 | 6 | 0.765766 | 0.846847 | 87 | 15.857143 | 36 | 0.016667 | 0.377778 |
1 rows × 53128 columns
newdata = newfinaldf.iloc[:-1]
y = textdata.label
newdata.shape
(2834, 53128)
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
results = []
for k in range(100, 500, 20):
selector = SelectKBest(chi2, k=k)
X_selected = selector.fit_transform(X_scaled, y)
selected_features = newdata.columns[selector.get_support()]
xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)
clsmodel = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=22)
clsmodel.fit(xtrain, ytrain)
trainpred = clsmodel.predict(xtrain)
testpred = clsmodel.predict(xtest)
train_accuracy = accuracy_score(ytrain, trainpred)
test_accuracy = accuracy_score(ytest, testpred)
accuracy_diff = abs(train_accuracy - test_accuracy)
results.append([k, train_accuracy, test_accuracy, accuracy_diff])
# Convert the results into a DataFrame
accuracy_df = pd.DataFrame(results, columns=['k', 'Training Accuracy', 'Testing Accuracy', 'Accuracy Difference'])
accuracy_df
| k | Training Accuracy | Testing Accuracy | Accuracy Difference | |
|---|---|---|---|---|
| 0 | 100 | 0.694310 | 0.320988 | 0.373322 |
| 1 | 120 | 0.704455 | 0.350970 | 0.353485 |
| 2 | 140 | 0.705337 | 0.347443 | 0.357895 |
| 3 | 160 | 0.704455 | 0.356261 | 0.348194 |
| 4 | 180 | 0.712836 | 0.358025 | 0.354812 |
| 5 | 200 | 0.709749 | 0.347443 | 0.362306 |
| 6 | 220 | 0.701367 | 0.352734 | 0.348634 |
| 7 | 240 | 0.702250 | 0.350970 | 0.351280 |
| 8 | 260 | 0.700926 | 0.379189 | 0.321738 |
| 9 | 280 | 0.702691 | 0.389771 | 0.312920 |
| 10 | 300 | 0.702691 | 0.382716 | 0.319975 |
| 11 | 320 | 0.691663 | 0.407407 | 0.284256 |
| 12 | 340 | 0.691222 | 0.380952 | 0.310269 |
| 13 | 360 | 0.694310 | 0.396825 | 0.297484 |
| 14 | 380 | 0.698280 | 0.391534 | 0.306745 |
| 15 | 400 | 0.699603 | 0.400353 | 0.299250 |
| 16 | 420 | 0.683723 | 0.395062 | 0.288661 |
| 17 | 440 | 0.693427 | 0.402116 | 0.291311 |
| 18 | 460 | 0.690340 | 0.391534 | 0.298805 |
| 19 | 480 | 0.696515 | 0.396825 | 0.299690 |
# !pip install tensorflow
# Tried Deep Learning but had the worse scores.
from sklearn.ensemble import RandomForestClassifier
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
results = []
# Loop over different numbers of features (k)
for k in range(10, 50, 5):
# Feature selection
selector = SelectKBest(chi2, k=k)
X_selected = selector.fit_transform(X_scaled, y)
selected_features = newdata.columns[selector.get_support()]
# Train-test split
xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)
# Build and train a Random Forest classifier
clsmodel = RandomForestClassifier(n_estimators=100, random_state=22)
clsmodel.fit(xtrain, ytrain)
# Predict and calculate accuracy
trainpred = clsmodel.predict(xtrain)
testpred = clsmodel.predict(xtest)
# Calculate accuracy
train_accuracy = accuracy_score(ytrain, trainpred)
test_accuracy = accuracy_score(ytest, testpred)
accuracy_diff = abs(train_accuracy - test_accuracy)
# Append results
results.append([k, train_accuracy, test_accuracy, accuracy_diff])
# Convert the results into a DataFrame
accuracy_df = pd.DataFrame(results, columns=['k', 'Training Accuracy', 'Testing Accuracy', 'Accuracy Difference'])
accuracy_df
| k | Training Accuracy | Testing Accuracy | Accuracy Difference | |
|---|---|---|---|---|
| 0 | 10 | 0.945302 | 0.285714 | 0.659588 |
| 1 | 15 | 0.999559 | 0.301587 | 0.697972 |
| 2 | 20 | 1.000000 | 0.306878 | 0.693122 |
| 3 | 25 | 1.000000 | 0.296296 | 0.703704 |
| 4 | 30 | 1.000000 | 0.317460 | 0.682540 |
| 5 | 35 | 1.000000 | 0.329806 | 0.670194 |
| 6 | 40 | 1.000000 | 0.336861 | 0.663139 |
| 7 | 45 | 1.000000 | 0.350970 | 0.649030 |
# Scaling the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
k = 320
selector = SelectKBest(chi2, k=k)
X_selected = selector.fit_transform(X_scaled, y)
selected_features = newdata.columns[selector.get_support()]
# Train-test split (you can skip this if you are using cross-validation for final results)
xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)
# Build and train the Gradient Boosting model
clsmodel = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=22)
# Perform 10-fold cross-validation
cv_scores = cross_val_score(clsmodel, X_selected, y, cv=10)
# Print results
print("Cross-validation scores for each fold:", cv_scores)
print("Mean cross-validation score:", cv_scores.mean())
Cross-validation scores for each fold: [0.32394366 0.29225352 0.28873239 0.33450704 0.39575972 0.42402827 0.41342756 0.36749117 0.3180212 0.34275618] Mean cross-validation score: 0.3500920718658239
selected_features
Index(['1870', '2009', '22', 'absorbing', 'accompanying', 'account', 'acid',
'adjustment', 'advantages', 'affected',
...
'lwf', 'gf', 'fh', 'sp', 'craw', 'gi', 'syllable_count',
'character_count', 'sentence_count', 'complex_word_count'],
dtype='object', length=320)
Not a promising model here. Could go with any of the above. Let's just try the one with 20% test data.¶
pd.Series(ytest).value_counts()
label easy 101 very_hard 98 challenging 96 hard 95 very_easy 89 moderate 88 Name: count, dtype: int64
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
testlabels = np.unique(ytest)
testcm = confusion_matrix(ytest, testpred, labels=testlabels)
print("Confusion Matrix for the Testing data\n-------------------------------------")
pd.DataFrame(testcm, index=testlabels, columns=testlabels)
Confusion Matrix for the Testing data -------------------------------------
| challenging | easy | hard | moderate | very_easy | very_hard | |
|---|---|---|---|---|---|---|
| challenging | 23 | 17 | 17 | 18 | 12 | 9 |
| easy | 6 | 31 | 15 | 14 | 29 | 6 |
| hard | 24 | 12 | 28 | 9 | 5 | 17 |
| moderate | 20 | 15 | 15 | 18 | 11 | 9 |
| very_easy | 6 | 17 | 6 | 9 | 50 | 1 |
| very_hard | 13 | 3 | 22 | 8 | 3 | 49 |
print(metrics.classification_report(ytest, testpred))
precision recall f1-score support
challenging 0.25 0.24 0.24 96
easy 0.33 0.31 0.32 101
hard 0.27 0.29 0.28 95
moderate 0.24 0.20 0.22 88
very_easy 0.45 0.56 0.50 89
very_hard 0.54 0.50 0.52 98
accuracy 0.35 567
macro avg 0.35 0.35 0.35 567
weighted avg 0.35 0.35 0.35 567
# Step 1: Remove the newexcerpt related row (if necessary)
numerical_newexcerpt = newfinaldf.tail(1)
# Step 2: Scale the new data using the same scaler
numerical_newexcerpt_scaled = scaler.transform(numerical_newexcerpt)
# Step 3: Select the top 320 features using the same feature selection
numerical_newexcerpt_selected = selector.transform(numerical_newexcerpt_scaled)
# Step 4: Fit the Gradient Boosting model again on the training data
clsmodel.fit(xtrain, ytrain) # This ensures the model is fitted
# Step 5: Make a prediction using the trained Gradient Boosting model
new_prediction = clsmodel.predict(numerical_newexcerpt_selected)
# Print the prediction result
print("Prediction for the new data:", new_prediction)
Prediction for the new data: ['challenging']
As seen above, the accuracy metrics are far from satisfactory. Despite extensive efforts in filtering out features using different models and experimenting with various test sizes, there has been no significant improvement in the model's performance. At one point, I even tried a deep learning model, but both the test and train accuracy hovered around 20%. This highlights an important lesson: no matter how sophisticated the model is, if the features are not relevant or meaningful, it will not enhance the predictions. As discussed in linear regression, we need features that genuinely influence the target variable. In this case, more informative or relevant data may be necessary to make an impact on the labels and improve the model's accuracy.
Problem 3.¶
Do the following.
- Let's define a term: lexical_diversity = (number of words in the text)/ (number of unique words in the text). Find and print the most diverse and least diverse text using the definition above. What is the range of the lexical diversity score?
- Find two lists of texts: the top 10 most similar and the top 10 most dissimilar excerpts in the original text data compared to the new excerpt using the cosine similarity metric. Then, repeat this process using the Jaccard Similarity coefficient as outlined on page 232 of the Web Data Mining book.(https://www.cs.uic.edu/~liub/WebMiningBook.html)
- Use the process explained in 6.7.3 An Example Example 12 of Web Data Mining Book pages 246-248 to find out the document matrix A and use the SVD to write A as a product of $U$, $\Sigma$, and $V^T$ for the 6 text documents below with the given keyword list.
textdata.head(2)
| textid | text | Bradly_Terry_Score | label | |
|---|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | -0.340259 | easy |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | -0.315372 | easy |
textdata.columns
Index(['textid', 'text', 'Bradly_Terry_Score', 'label'], dtype='object')
def lexical_diversity(sometext):
words = sometext.split()
uniquewords = list(set(words))
divscore = round(len(words)/len(uniquewords),2)
return divscore
textdf = textdata[['textid', 'text']]
textdf["lexical_diversity"] = textdf['text'].apply(lambda x: lexical_diversity(x))
textdf.head()
| textid | text | lexical_diversity | |
|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | 1.57 |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | 1.33 |
| 2 | b69ac6792 | As Roger had predicted, the snow departed as q... | 1.30 |
| 3 | dd1000b26 | And outside before the palace a great garden w... | 1.39 |
| 4 | 37c1b32fb | Once upon a time there were Three Bears who li... | 2.88 |
textdf.nlargest(10, 'lexical_diversity')
| textid | text | lexical_diversity | |
|---|---|---|---|
| 859 | 420b4ae48 | Cat and Dog look through the window. They look... | 4.24 |
| 858 | b55026bd9 | This is Cat. This is Dog. Cat and Dog live in ... | 3.18 |
| 861 | 78006971c | For Dog it is too cold. Cat gives Dog underwea... | 3.09 |
| 4 | 37c1b32fb | Once upon a time there were Three Bears who li... | 2.88 |
| 860 | 1c6ffcd35 | Dog is in his house. Dog is sitting in his hou... | 2.51 |
| 2790 | dc8bb7a8c | Acceleration is a measure of how fast velocity... | 2.43 |
| 807 | b3f2457aa | A nerve is a group of special nerve cells grou... | 2.36 |
| 263 | 2b2fdfc8c | The boiling point of a substance is the temper... | 2.35 |
| 1004 | c182a398b | Mother Goat passes by. "Will you go to the fai... | 2.32 |
| 2695 | 6c755953d | "The little girl wants a warm plaid dress. I w... | 2.32 |
textdf.nsmallest(10, 'lexical_diversity')
| textid | text | lexical_diversity | |
|---|---|---|---|
| 2116 | 854fc1710 | A piazza must be had.\nThe house was wide—my f... | 1.23 |
| 2379 | d8c7bf9bc | After some work Tom succeeded in reducing the ... | 1.23 |
| 177 | 669b6d8e1 | They had got "way through," as Terry said, to ... | 1.25 |
| 1396 | 04917fcad | I know of no savage custom or habit of thought... | 1.25 |
| 571 | a4fa3021c | A smartwatch is a computerized wristwatch with... | 1.26 |
| 1287 | d4a81e7b0 | Careful investigation by our committees who ha... | 1.26 |
| 1459 | 4ba8e0311 | Bull, John, a fine, fat, American-beef fed ind... | 1.26 |
| 2210 | cd4a51e02 | "Mollie Thurston, we are lost!" cried Barbara ... | 1.26 |
| 1798 | 53bc19945 | THIS house has no roof, no chimney, no windows... | 1.27 |
| 1813 | 2c21c73ae | The sparrow looks saucily at him, saying, "Ah,... | 1.27 |
# The range of the lexical diversity score can vary from 1 to any number depending on
# the size of the text. Here it is ~3.
max(textdf.lexical_diversity)-min(textdf.lexical_diversity)
3.0100000000000002
# newfinaldf has all the data we need. We can add two. Let's just pick the ids from
#the textdata
newfinaldf.head(2)
| 00 | 000 | 000th | 001 | 02 | 03 | 034 | 04 | 049 | 06 | ... | character_count | word_count | sentence_count | lexical_density | ttr | hapax_legomena | avg_sentence_length | complex_word_count | polarity | subjectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 814 | 179 | 11 | 0.849162 | 0.636872 | 89 | 14.916667 | 10 | 0.134848 | 0.525758 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 769 | 169 | 14 | 0.727811 | 0.751479 | 107 | 15.545455 | 10 | 0.133999 | 0.566643 |
2 rows × 53128 columns
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity([newfinaldf.iloc[-1, :]], newfinaldf.iloc[:-1, :])
cosine_sim
array([[0.98272164, 0.97822125, 0.97698086, ..., 0.97320766, 0.97769499,
0.98402855]])
cos_simlist = list(cosine_sim[0].round(2))
len(cos_simlist)
2834
textdata.shape
(2834, 4)
textsim = textdata[['textid', "text"]]
textsim["cos_sim_with_newexcerpt"] = cos_simlist
textsim.head()
| textid | text | cos_sim_with_newexcerpt | |
|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | 0.98 |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | 0.98 |
| 2 | b69ac6792 | As Roger had predicted, the snow departed as q... | 0.98 |
| 3 | dd1000b26 | And outside before the palace a great garden w... | 0.98 |
| 4 | 37c1b32fb | Once upon a time there were Three Bears who li... | 0.96 |
textsim.nlargest(10, 'cos_sim_with_newexcerpt')
| textid | text | cos_sim_with_newexcerpt | |
|---|---|---|---|
| 10 | c57b50918 | It was believed by the principal men of Virgin... | 1.0 |
| 252 | a6045da7b | Many people like to learn about their family h... | 1.0 |
| 253 | 0d3a8f33b | Big data is a term for data sets that are so l... | 1.0 |
| 256 | e4d810c98 | Although not normally what first comes to mind... | 1.0 |
| 265 | 14365d003 | Brain implants, often referred to as neural im... | 1.0 |
| 266 | 92a8d63d2 | In telecommunications, broadband is a wide ban... | 1.0 |
| 267 | b12cb6e0d | The first regular television broadcasts starte... | 1.0 |
| 273 | 201eff52d | A cabinet is a body of high-ranking state offi... | 1.0 |
| 276 | 62526c010 | Carbon dioxide (chemical formula CO2) is a col... | 1.0 |
| 277 | d74e2a8a3 | Carbon monoxide is produced from the partial o... | 1.0 |
textsim.nsmallest(10, 'cos_sim_with_newexcerpt')
| textid | text | cos_sim_with_newexcerpt | |
|---|---|---|---|
| 990 | 9ba54834d | My cousin Kamohelo leans on her hoe. What do I... | 0.92 |
| 1803 | 0de5939cc | MASTER BABY has been playing in the park all t... | 0.93 |
| 1917 | d64329167 | Here is a boy drawing on a wall. He is a shoem... | 0.93 |
| 1974 | 55fa093cb | "But why?" yelped the pup, as the maid threw a... | 0.93 |
| 1975 | 119070b51 | The horse and the cow, in great grief, came an... | 0.93 |
| 2586 | 34ec7fa04 | Once, just as the long, dark time that is at t... | 0.93 |
| 42 | 860580bf0 | Jem hid her face on her arms and cried as if h... | 0.94 |
| 858 | b55026bd9 | This is Cat. This is Dog. Cat and Dog live in ... | 0.94 |
| 1938 | c94355a18 | On a dry pleasant day, last autumn, I saw her ... | 0.94 |
| 2057 | b49719b13 | Edwin has two doves. They were given to him by... | 0.94 |
# Jaccard Similarity
def jaccard(text1, text2):
words1 = set(text1.lower().split())
words2 = set(text2.lower().split())
jaccard_sim = round(len(words1.intersection(words2))/len(words1.union(words2)), 2)
return jaccard_sim
textsim["jacc_sim_with_newexcerpt"] = textsim['text'].apply(lambda x: jaccard(new_excerpt, x))
textsim.head()
| textid | text | cos_sim_with_newexcerpt | jacc_sim_with_newexcerpt | |
|---|---|---|---|---|
| 0 | c12129c31 | When the young people returned to the ballroom... | 0.98 | 0.03 |
| 1 | 85aa80a4c | All through dinner time, Mrs. Fayre was somewh... | 0.98 | 0.03 |
| 2 | b69ac6792 | As Roger had predicted, the snow departed as q... | 0.98 | 0.03 |
| 3 | dd1000b26 | And outside before the palace a great garden w... | 0.98 | 0.04 |
| 4 | 37c1b32fb | Once upon a time there were Three Bears who li... | 0.96 | 0.02 |
textsim[["textid", "text", "jacc_sim_with_newexcerpt"]].nlargest(10, 'jacc_sim_with_newexcerpt')
| textid | text | jacc_sim_with_newexcerpt | |
|---|---|---|---|
| 253 | 0d3a8f33b | Big data is a term for data sets that are so l... | 0.11 |
| 383 | 77f73d19f | Geology describes the structure of the Earth o... | 0.07 |
| 727 | c9849a3ad | The brain works like a computer, with multiple... | 0.07 |
| 279 | 84101eee4 | Brain functions, like perceptions, thoughts, a... | 0.06 |
| 305 | 8f11d4954 | A computer program is a list of instructions t... | 0.06 |
| 320 | f43a27b6d | Data visualization or data visualization is vi... | 0.06 |
| 322 | f1a527e3b | Databending (or data bending) is the process o... | 0.06 |
| 423 | 5d8da7a16 | Information technology (IT) is the use of comp... | 0.06 |
| 426 | 9fb92d9b4 | The Internet Protocol (IP) is the principal co... | 0.06 |
| 429 | 8057d0e72 | Intranets can help users to locate and view in... | 0.06 |
textsim[["textid", "text", "jacc_sim_with_newexcerpt"]].nsmallest(10, 'jacc_sim_with_newexcerpt')
| textid | text | jacc_sim_with_newexcerpt | |
|---|---|---|---|
| 48 | 90f7894fc | One night, returning from a hard day, on which... | 0.01 |
| 205 | b9c1ffa01 | Though he was thoughtful beyond his years and ... | 0.01 |
| 792 | cfa18ebad | Every day after school, Abebe went to the fiel... | 0.01 |
| 861 | 78006971c | For Dog it is too cold. Cat gives Dog underwea... | 0.01 |
| 1444 | 3fdffab6d | "Madam," said the white rooster, bowing very l... | 0.01 |
| 1873 | 6cfa2f783 | Mrs. S. had a new cook; and one day she set a ... | 0.01 |
| 2200 | aae3150c4 | About ten o'clock on the following morning, se... | 0.01 |
| 2260 | decae8817 | The Times' gentleman (a very difficult gent to... | 0.01 |
| 2261 | aac8c0e7d | This royal pair had one only child, the Prince... | 0.01 |
| 2333 | 7d909bbc3 | The King had already been married once and had... | 0.01 |
# Quotes
Henry_Ford = "Whether you think you can, or you think you can’t—you’re right."
Andrew_Carnegie = "The first one gets the oyster, the second gets the shell."
Warren_Buffett = "Someone’s sitting in the shade today because someone planted a tree a long time ago."
Mary_Kay_Ash = "Pretend that every single person you meet has a sign around their neck that says, 'Make me feel important.' Not only will you succeed in sales, you will succeed in life."
Richard_Branson = "Business opportunities are like buses, there’s always another one coming."
Jack_Welch = "Change before you have to."
# Wrong keywords = [
# 'Opportunity', 'Success', 'Vision', 'Innovation', 'Leadership', 'Strategy',
# 'Growth', 'Change', 'Ambition', 'Determination', 'Value', 'Persistence',
# 'Leadership', 'Sales', 'Transformation'
#]
keywords = [
"think", "right", "first", "oyster", "second", "shell",
"shade", "tree", "long", "important", "sales",
"life", "opportunities", "change", "coming"]
I gave a wonrg list of key words so you can just give full poitnfor this part when you grade.¶
import pandas as pd
data = pd.DataFrame()
data['docs'] = ["hf", "ac", "wb", "ma", "rb", "jw"]
data['content'] = [Henry_Ford, Andrew_Carnegie, Warren_Buffett, Mary_Kay_Ash, Richard_Branson, Jack_Welch]
vect3 = CountVectorizer()
vectors = vect3.fit_transform(data.content)
td = pd.DataFrame(vectors.todense())
td.columns = vect3.get_feature_names_out()
term_document_matrix = td.T
term_document_matrix.columns = ["hf", "ac", "wb", "ma", "rb", "jw"]
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)
tdmatrix = term_document_matrix.drop(columns=['total_count'])
tdmatrix
| hf | ac | wb | ma | rb | jw | |
|---|---|---|---|---|---|---|
| ago | 0 | 0 | 1 | 0 | 0 | 0 |
| always | 0 | 0 | 0 | 0 | 1 | 0 |
| another | 0 | 0 | 0 | 0 | 1 | 0 |
| are | 0 | 0 | 0 | 0 | 1 | 0 |
| around | 0 | 0 | 0 | 1 | 0 | 0 |
| because | 0 | 0 | 1 | 0 | 0 | 0 |
| before | 0 | 0 | 0 | 0 | 0 | 1 |
| buses | 0 | 0 | 0 | 0 | 1 | 0 |
| business | 0 | 0 | 0 | 0 | 1 | 0 |
| can | 2 | 0 | 0 | 0 | 0 | 0 |
| change | 0 | 0 | 0 | 0 | 0 | 1 |
| coming | 0 | 0 | 0 | 0 | 1 | 0 |
| every | 0 | 0 | 0 | 1 | 0 | 0 |
| feel | 0 | 0 | 0 | 1 | 0 | 0 |
| first | 0 | 1 | 0 | 0 | 0 | 0 |
| gets | 0 | 2 | 0 | 0 | 0 | 0 |
| has | 0 | 0 | 0 | 1 | 0 | 0 |
| have | 0 | 0 | 0 | 0 | 0 | 1 |
| important | 0 | 0 | 0 | 1 | 0 | 0 |
| in | 0 | 0 | 1 | 2 | 0 | 0 |
| life | 0 | 0 | 0 | 1 | 0 | 0 |
| like | 0 | 0 | 0 | 0 | 1 | 0 |
| long | 0 | 0 | 1 | 0 | 0 | 0 |
| make | 0 | 0 | 0 | 1 | 0 | 0 |
| me | 0 | 0 | 0 | 1 | 0 | 0 |
| meet | 0 | 0 | 0 | 1 | 0 | 0 |
| neck | 0 | 0 | 0 | 1 | 0 | 0 |
| not | 0 | 0 | 0 | 1 | 0 | 0 |
| one | 0 | 1 | 0 | 0 | 1 | 0 |
| only | 0 | 0 | 0 | 1 | 0 | 0 |
| opportunities | 0 | 0 | 0 | 0 | 1 | 0 |
| or | 1 | 0 | 0 | 0 | 0 | 0 |
| oyster | 0 | 1 | 0 | 0 | 0 | 0 |
| person | 0 | 0 | 0 | 1 | 0 | 0 |
| planted | 0 | 0 | 1 | 0 | 0 | 0 |
| pretend | 0 | 0 | 0 | 1 | 0 | 0 |
| re | 1 | 0 | 0 | 0 | 0 | 0 |
| right | 1 | 0 | 0 | 0 | 0 | 0 |
| sales | 0 | 0 | 0 | 1 | 0 | 0 |
| says | 0 | 0 | 0 | 1 | 0 | 0 |
| second | 0 | 1 | 0 | 0 | 0 | 0 |
| shade | 0 | 0 | 1 | 0 | 0 | 0 |
| shell | 0 | 1 | 0 | 0 | 0 | 0 |
| sign | 0 | 0 | 0 | 1 | 0 | 0 |
| single | 0 | 0 | 0 | 1 | 0 | 0 |
| sitting | 0 | 0 | 1 | 0 | 0 | 0 |
| someone | 0 | 0 | 2 | 0 | 0 | 0 |
| succeed | 0 | 0 | 0 | 2 | 0 | 0 |
| that | 0 | 0 | 0 | 2 | 0 | 0 |
| the | 0 | 4 | 1 | 0 | 0 | 0 |
| their | 0 | 0 | 0 | 1 | 0 | 0 |
| there | 0 | 0 | 0 | 0 | 1 | 0 |
| think | 2 | 0 | 0 | 0 | 0 | 0 |
| time | 0 | 0 | 1 | 0 | 0 | 0 |
| to | 0 | 0 | 0 | 0 | 0 | 1 |
| today | 0 | 0 | 1 | 0 | 0 | 0 |
| tree | 0 | 0 | 1 | 0 | 0 | 0 |
| whether | 1 | 0 | 0 | 0 | 0 | 0 |
| will | 0 | 0 | 0 | 2 | 0 | 0 |
| you | 5 | 0 | 0 | 3 | 0 | 1 |
docmatrix = tdmatrix.loc[tdmatrix.index.isin(keywords)] # Filter the given terms
docmatrix = docmatrix.loc[keywords] # Order by the terms
A = docmatrix
A
| hf | ac | wb | ma | rb | jw | |
|---|---|---|---|---|---|---|
| think | 2 | 0 | 0 | 0 | 0 | 0 |
| right | 1 | 0 | 0 | 0 | 0 | 0 |
| first | 0 | 1 | 0 | 0 | 0 | 0 |
| oyster | 0 | 1 | 0 | 0 | 0 | 0 |
| second | 0 | 1 | 0 | 0 | 0 | 0 |
| shell | 0 | 1 | 0 | 0 | 0 | 0 |
| shade | 0 | 0 | 1 | 0 | 0 | 0 |
| tree | 0 | 0 | 1 | 0 | 0 | 0 |
| long | 0 | 0 | 1 | 0 | 0 | 0 |
| important | 0 | 0 | 0 | 1 | 0 | 0 |
| sales | 0 | 0 | 0 | 1 | 0 | 0 |
| life | 0 | 0 | 0 | 1 | 0 | 0 |
| opportunities | 0 | 0 | 0 | 0 | 1 | 0 |
| change | 0 | 0 | 0 | 0 | 0 | 1 |
| coming | 0 | 0 | 0 | 0 | 1 | 0 |
from numpy.linalg import svd
# Perform Singular Value Decomposition
U, s, Vt = svd(A)
# Convert the singular values into a diagonal matrix
S = np.diag(s)
# Display the U, S, and Vt matrices
print("Matrix U:")
print(U)
print("\nMatrix S (Singular values):")
print(S)
print("\nMatrix Vt:")
print(Vt)
Matrix U: [[-0.89442719 0. 0. 0. 0. 0. -0.12909944 -0.12909944 -0.12909944 -0.12909944 -0.12909944 -0.12909944 -0.15811388 -0.2236068 -0.15811388] [-0.4472136 0. 0. 0. 0. 0. 0.25819889 0.25819889 0.25819889 0.25819889 0.25819889 0.25819889 0.31622777 0.4472136 0.31622777] [ 0. -0.5 0. 0. 0. 0. -0.4330127 -0.4330127 -0.4330127 0.14433757 0.14433757 0.14433757 0.1767767 0.25 0.1767767 ] [ 0. -0.5 0. 0. 0. 0. 0.14433757 0.14433757 0.14433757 -0.4330127 -0.4330127 -0.4330127 0.1767767 0.25 0.1767767 ] [ 0. -0.5 0. 0. 0. 0. 0.14433757 0.14433757 0.14433757 0.14433757 0.14433757 0.14433757 -0.53033009 0.25 -0.53033009] [ 0. -0.5 0. 0. 0. 0. 0.14433757 0.14433757 0.14433757 0.14433757 0.14433757 0.14433757 0.1767767 -0.75 0.1767767 ] [ 0. 0. -0.57735027 0. 0. 0. 0.66666667 -0.33333333 -0.33333333 0. 0. 0. 0. 0. 0. ] [ 0. 0. -0.57735027 0. 0. 0. -0.33333333 0.66666667 -0.33333333 0. 0. 0. 0. 0. 0. ] [ 0. 0. -0.57735027 0. 0. 0. -0.33333333 -0.33333333 0.66666667 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. -0.57735027 0. 0. 0. 0. 0. 0.66666667 -0.33333333 -0.33333333 0. 0. 0. ] [ 0. 0. 0. -0.57735027 0. 0. 0. 0. 0. -0.33333333 0.66666667 -0.33333333 0. 0. 0. ] [ 0. 0. 0. -0.57735027 0. 0. 0. 0. 0. -0.33333333 -0.33333333 0.66666667 0. 0. 0. ] [ 0. 0. 0. 0. -0.70710678 0. 0. 0. 0. 0. 0. 0. 0.5 0. -0.5 ] [ 0. 0. 0. 0. 0. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. -0.70710678 0. 0. 0. 0. 0. 0. 0. -0.5 0. 0.5 ]] Matrix S (Singular values): [[2.23606798 0. 0. 0. 0. 0. ] [0. 2. 0. 0. 0. 0. ] [0. 0. 1.73205081 0. 0. 0. ] [0. 0. 0. 1.73205081 0. 0. ] [0. 0. 0. 0. 1.41421356 0. ] [0. 0. 0. 0. 0. 1. ]] Matrix Vt: [[-1. -0. -0. -0. -0. -0.] [-0. -1. -0. -0. -0. -0.] [-0. -0. -1. -0. -0. -0.] [-0. -0. -0. -1. -0. -0.] [-0. -0. -0. -0. -1. -0.] [-0. -0. -0. -0. -0. -1.]]
print("sigma =", s)
sigma = [2.23606798 2. 1.73205081 1.73205081 1.41421356 1. ]
Refer to Chapter 7 of Web Data Mining Book for the following problems.
Problem 4. (Refer to Chapter 7 of Web Data Mining Book for this problem.)¶
Upload and read the social network connections dataset containing two columns (id_1 and id_2), representing the connections between individuals.
- Calculate the total number of connections for each ID by aggregating the values from both columns, treating each as a connection count for an individual. Identify the top 10 IDs with the most connections, and print them in descending order to highlight the central actors in the network.
- Remove all IDs that have 300 or fewer connections from both columns (id_1 and id_2) to focus on the more central actors within the network. Display the shape of the filtered dataset to verify the reduced size and check the value counts of id_1 to understand the distribution of connections after filtering. Verify that all remaining IDs have more than 300 connections.
- Create a network graph for the ID with the most connections, adding labels, titles, and highlighting the central node to show its importance. Then, compute the betweenness centrality for the network, which indicates how nodes bridge others. Find the top 10 IDs with the highest centrality scores and display them in a bar chart with clear labels and titles to illustrate their significance.
Helpful link. https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html
import pandas as pd
ids = pd.read_csv("sn_ids.csv")
ids.head()
| id_1 | id_2 | |
|---|---|---|
| 0 | 0 | 23977 |
| 1 | 1 | 34526 |
| 2 | 1 | 2370 |
| 3 | 1 | 14683 |
| 4 | 1 | 29982 |
ids.id_1.value_counts().nlargest(10)
id_1 27803 6809 31890 1988 13638 1610 19222 1459 9051 1378 2078 1295 7027 1224 10001 1149 5629 1111 73 1085 Name: count, dtype: int64
ids.id_2.value_counts().nlargest(10)
id_2 31890 7470 35773 2401 36652 2285 18163 1858 19222 1499 36628 1477 35008 1472 3712 884 13638 858 30002 819 Name: count, dtype: int64
id1counts = ids.id_1.value_counts().to_frame().reset_index()
id1counts.columns = ["id_1", "connections"]
id1mostfrequent = id1counts[id1counts.connections > 300]
frequentid1 = list(id1mostfrequent.id_1)
print(frequentid1)
print(id1mostfrequent.shape)
[27803, 31890, 13638, 19222, 9051, 2078, 7027, 10001, 5629, 73, 33671, 35773, 11051, 10595, 3153, 19253, 11279, 3922, 14242, 974, 6631, 7195, 18945, 2281, 2635, 14954, 10080, 8635, 10830, 22881, 22642, 36289, 29421, 3712, 1164, 494, 9780, 20173, 23589, 2431, 18562, 22353, 33799] (43, 2)
filteredby_id1 = ids[ids["id_1"].isin(frequentid1)]
print(filteredby_id1.shape)
(35713, 2)
id2counts = ids.id_2.value_counts().to_frame().reset_index()
id2counts.columns = ["id_2", "connections"]
id2mostfrequent = id2counts[id2counts.connections > 300]
frequentid2 = list(id2mostfrequent.id_2)
print(frequentid2)
print(id2mostfrequent.shape)
[31890, 35773, 36652, 18163, 19222, 36628, 35008, 3712, 13638, 30002, 15191, 19253, 33029, 25477, 23589, 22642, 28957, 22666, 36790, 30199, 35523, 22881, 23664, 34536, 33643, 37289, 16119, 22832, 31917, 22353, 35876, 27450, 37471, 10001, 9051, 34114, 31126, 21142, 29982, 30809, 14242, 27302, 25249, 23838, 36819, 11051, 37107, 5323, 5300, 22321, 30235, 33128, 32753] (53, 2)
filtered_df = ids[ids['id_1'].isin(frequentid1) & ids['id_2'].isin(frequentid2)]
filtered_df.shape
(661, 2)
filteredby_id2 = ids[ids["id_2"].isin(frequentid2)]
print(filteredby_id2.shape)
(40840, 2)
filteredby_id2.id_2.value_counts()
id_2 31890 7470 35773 2401 36652 2285 18163 1858 19222 1499 36628 1477 35008 1472 3712 884 13638 858 30002 819 15191 811 19253 723 33029 686 25477 635 23589 621 22642 596 28957 584 22666 583 36790 573 30199 536 35523 527 22881 512 23664 488 34536 484 33643 477 37289 471 16119 469 22832 467 31917 461 22353 454 35876 448 27450 437 37471 425 10001 419 9051 419 34114 399 31126 399 21142 390 29982 389 30809 375 14242 374 27302 373 25249 372 36819 363 23838 363 11051 358 5323 355 37107 355 5300 351 22321 350 30235 341 33128 303 32753 301 Name: count, dtype: int64
# Looking above, the id 31890 from id2 has the most number of connections.
# Picking the id 27803 from id1 is fine.
mostconnected = ids[ids.id_2 == 31890]
mostconnected.shape
(7470, 2)
import networkx as nx
import matplotlib.pyplot as plt
G = nx.from_pandas_edgelist(mostconnected, 'id_1', 'id_2')
plt.figure(figsize=(20,20))
nx.draw(G)
plt.show()
import networkx as nx
import matplotlib.pyplot as plt
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2')
most_connected_id = max(dict(G.degree()).items(), key=lambda x: x[1])[0]
# Step 3: Plot the network highlighting the central node
plt.figure(figsize=(16, 16))
pos = nx.spring_layout(G, k=0.1) # Position the nodes using spring layout
# Draw the nodes with different colors for the most connected node
nx.draw(G, pos, node_color='skyblue', node_size=50, with_labels=False)
nx.draw_networkx_nodes(G, pos, nodelist=[most_connected_id], node_color='red', node_size=200)
# Add labels for the central node and the title
nx.draw_networkx_labels(G, pos, labels={most_connected_id: most_connected_id}, font_size=12, font_color='black')
plt.title(f"Network Graph Highlighting the Most Connected ID: {most_connected_id}", size=15)
plt.show()
ids.head()
| id_1 | id_2 | |
|---|---|---|
| 0 | 0 | 23977 |
| 1 | 1 | 34526 |
| 2 | 1 | 2370 |
| 3 | 1 | 14683 |
| 4 | 1 | 29982 |
ids.shape
(289003, 2)
# 10 highest betweenness centrality.
df = ids.sample(n=100000)
G = nx.from_pandas_edgelist(df, "id_1", "id_2")
closecent = nx.closeness_centrality(G)
ccdf = pd.DataFrame(closecent.items())
ccdf.columns = ['id', 'betweenness_centrality']
ccdf.head()
| id | betweenness_centrality | |
|---|---|---|
| 0 | 6218 | 0.296119 |
| 1 | 32338 | 0.241891 |
| 2 | 20528 | 0.293416 |
| 3 | 6498 | 0.255286 |
| 4 | 12571 | 0.314411 |
ccdf.dtypes
id int64 betweenness_centrality float64 dtype: object
import seaborn as sns
import matplotlib.pyplot as plt
top_10 = ccdf.nlargest(10, 'betweenness_centrality')
sns.barplot(x='id', y='betweenness_centrality', data=top_10)
plt.xlabel('IDs')
plt.ylabel('Betweenness Centrality')
plt.title('Top 10 Betweenness Centrality')
plt.show()